NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accurate short-read alignment through r-index-based pangenome indexing

https://doi.org/10.1101/gr.279858.124

Varki, Rahul; Rossi, Massimiliano; Ferro, Eddie; Oliva, Marco; Garrison, Erik; Langmead, Ben; Boucher, Christina (June 2025, Genome Research)

Aligning to a linear reference genome can result in a higher percentage of reads going unmapped or being incorrectly mapped owing to variations not captured by the reference, otherwise known as reference bias. Recently, in efforts to mitigate reference bias, there has been a movement to switch to using pangenomes, a collection of genomes, as the reference. In this paper, we introduce Moni-align, the first short-read pangenome aligner built on the r-index, a variation of the classical FM-index that can index collections of genomes in O(r)-space, whereris the number of runs in the Burrows–Wheeler transform. Moni-align uses a seed-and-extend strategy for aligning reads, utilizing maximal exact matches as seeds, which can be efficiently obtained with ther-index. Using both simulated and real short-read data sets, we demonstrate that Moni-align achieves alignment accuracy comparable to vg map and vg giraffe, the leading pangenome aligners. Although currently best suited for aligning to localized pangenomes owing to computational constraints, Moni-align offers a robust foundation for future optimizations that could further broaden its applicability.
more » « less
Full Text Available
Pfp-fm: an accelerated FM-index

https://doi.org/10.1186/s13015-024-00260-8

Hong, Aaron; Oliva, Marco; Köppl, Dominik; Bannai, Hideo; Boucher, Christina; Gagie, Travis (December 2024, Algorithms for Molecular Biology)

Abstract FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for$$\texttt {PFP-FM}$$ $PFP - FM$ is available athttps://github.com/AaronHong1024/afm.
more » « less
Full Text Available
Building a pangenome alignment index via recursive prefix-free parsing

https://doi.org/10.1016/j.isci.2024.110933

Ferro, Eddie; Oliva, Marco; Gagie, Travis; Boucher, Christina (October 2024, iScience)

Full Text Available
An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $$k$$ Between Genome Pairs

https://doi.org/10.1109/ICHI61247.2024.00020

Prosperi, Mattia; Marini, Simone; Boucher, Christina (June 2024, IEEE)

A problem extension of the longest common sub-string (LCS) between two texts is the enumeration of all LCSs given a minimum length k (ALCS-k), along with their positions in each text. In bioinformatics, an efficient solution to the ALCS- k for very long texts -genomes or metagenomes- can provide useful insights to discover genetic signatures responsible for biological mechanisms. The ALCS-k problem has two additional requirements compared to the LCS problem: one is the minimum length k , and the other is that all common strings longer than k must be reported. We present an efficient, two-stage ALCS-k algorithm exploiting the spectrum of text substrings of length k (k-mers). Our approach yields a worst-case time complexity loglinear in the number of k-mers for the first stage, and an average-case loglinear in the number of common k-mers for the second stage (several orders of magnitudes smaller than the total k-mer spectrum). The space complexity is linear in the first phase (disk-based), and on average linear in the second phase (disk- and memory-based). Tests performed on genomes for different organisms (including viruses, bacteria and animal chromosomes) show that run times are consistent with our theoretical estimates; further, comparisons with MUMmer4 show an asymptotic advantage with divergent genomes.
more » « less
Full Text Available
A study at the wildlife-livestock interface unveils the potential of feral swine as a reservoir for extended-spectrum β-lactamase-producing Escherichia coli

https://doi.org/10.1016/j.jhazmat.2024.134694

Liu, Ting; Lee, Shinyoung; Kim, Miju; Fan, Peixin; Boughton, Raoul K; Boucher, Christina; Jeong, Kwangcheol C (July 2024, Journal of Hazardous Materials)

Full Text Available
Solving the Minimal Positional Substring Cover Problem in Sublinear Space

https://doi.org/10.4230/LIPIcs.CPM.2024.12

Bonizzoni, Paola; Boucher, Christina; Cozzi, Davide; Gagie, Travis; Pirola, Yuri (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Inenaga, Shunsuke; Puglisi, Simon J (Ed.)
Within the field of haplotype analysis, the Positional Burrows-Wheeler Transform (PBWT) stands out as a key innovation, addressing numerous challenges in genomics. For example, Sanaullah et al. introduced a PBWT-based method that addresses the haplotype threading problem, which involves representing a query haplotype through a minimal set of substrings. To solve this problem using the PBWT data structure, they formulate the Minimal Positional Substring Cover (MPSC) problem, and then, subsequently present a solution for it. Additionally, they present and solve several variants of this problem: k-MPSC, leftmost MPSC, rightmost MPSC, and length-maximal MPSC. Yet, a full PBWT is required for each of their solutions, which yields a significant memory usage requirement. Here, we take advantage of the latest results on run-length encoding the PBWT, to solve the MPSC in a sublinear amount of space. Our methods involve demonstrating that k-Set Maximal Exact Matches (k-SMEMs) can be computed in a sublinear amount of space via efficient computation of k-Matching Statistics (k-MS). This leads to a solution that requires sublinear space for, not only the MPSC problem, but for all its variations proposed by Sanaullah et al. Most importantly, we present experimental results on haplotype panels from the 1000 Genomes Project data that show the utility of these theoretical results. We conclusively demonstrate that our approach markedly decreases the memory required to solve the MPSC problem, achieving a reduction of at least two orders of magnitude compared to the method proposed by Sanaullah et al. This efficiency allows us to solve the problem on large versions of the problem, where other methods are unable to scale to. In summary, the creation of {μ}-PBWT paves the way for new possibilities in conducting in-depth genetic research and analysis on a large scale. All source code is publicly available at https://github.com/dlcgold/muPBWT/tree/k-smem.
more » « less
Full Text Available
ONeSAMP 3.0: estimation of effective population size via single nucleotide polymorphism data from one population

https://doi.org/10.1093/g3journal/jkae153

Hong, Aaron; Cheek, Rebecca G; De_Silva, Suhashi Nihara; Mukherjee, Kingshuk; Yooseph, Isha; Oliva, Marco; Heim, Mark; W_Funk, Chris; Tallmon, David; Boucher, Christina (July 2024, G3: Genes, Genomes, Genetics)
Myers, C (Ed.)
Abstract The genetic effective size (Ne) is arguably one of the most important characteristics of a population as it impacts the rate of loss of genetic diversity. Methods that estimate Ne are important in population and conservation genetic studies as they quantify the risk of a population being inbred or lacking genetic diversity. Yet there are very few methods that can estimate the Ne from data from a single population and without extensive information about the genetics of the population, such as a linkage map, or a reference genome of the species of interest. We present ONeSAMP 3.0, an algorithm for estimating Ne from single nucleotide polymorphism data collected from a single population sample using approximate Bayesian computation and local linear regression. We demonstrate the utility of this approach using simulated Wright–Fisher populations, and empirical data from five endangered Channel Island fox (Urocyon littoralis) populations to evaluate the performance of ONeSAMP 3.0 compared to a commonly used Ne estimator. Our results show that ONeSAMP 3.0 is broadly applicable to natural populations and is flexible enough that future versions could easily include summary statistics appropriate for a suite of biological and sampling conditions. ONeSAMP 3.0 is publicly available under the GNU General Public License at https://github.com/AaronHong1024/ONeSAMP_3.
more » « less
Full Text Available
SPUMONI 2: improved classification using a pangenome index of minimizer digests

https://doi.org/10.1186/s13059-023-02958-1

Ahmed, Omar Y.; Rossi, Massimiliano; Gagie, Travis; Boucher, Christina; Langmead, Ben (December 2023, Genome Biology)

Abstract Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2’s index is 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.
more » « less
Full Text Available
Recursive Prefix-Free Parsing for Building Big BWTs

https://doi.org/10.1109/DCC55655.2023.00014

Oliva, Marco; Gagie, Travis; Boucher, Christina (March 2023, IEEE Data Compression Conference)

Full Text Available
The K-mer antibiotic resistance gene variant analyzer (KARGVA)

https://doi.org/10.3389/fmicb.2023.1060891

Marini, Simone; Boucher, Christina; Noyes, Noelle; Prosperi, Mattia (March 2023, Frontiers in Microbiology)

Characterization of antibiotic resistance genes (ARGs) from high-throughput sequencing data of metagenomics and cultured bacterial samples is a challenging task, with the need to account for both computational (e.g., string algorithms) and biological (e.g., gene transfers, rearrangements) aspects. Curated ARG databases exist together with assorted ARG classification approaches (e.g., database alignment, machine learning). Besides ARGs that naturally occur in bacterial strains or are acquired through mobile elements, there are chromosomal genes that can render a bacterium resistant to antibiotics through point mutations, i.e., ARG variants (ARGVs). While ARG repositories also collect ARGVs, there are only a few tools that are able to identify ARGVs from metagenomics and high throughput sequencing data, with a number of limitations (e.g., pre-assembly,a posterioriverification of mutations, or specification of species). In this work we present thek-mer, i.e., strings of fixed lengthk, ARGV analyzer – KARGVA – an open-source, multi-platform tool that provides: (i) anad hoc, large ARGV database derived from multiple sources; (ii) input capability for various types of high-throughput sequencing data; (iii) a three-way, hash-based,k-mer search setup to process data efficiently, linkingk-mers to ARGVs,k-mers to point mutations, and ARGVs tok-mers, respectively; (iv) a statistical filter on sequence classification to reduce type I and II errors. On semi-synthetic data, KARGVA provides very high accuracy even in presence of high sequencing errors or mutations (99.2 and 86.6% accuracy within 1 and 5% base change rates, respectively), and genome rearrangements (98.2% accuracy), with robust performance onad hocfalse positive sets. On data from the worldwide MetaSUB consortium, comprising 3,700+ metagenomics experiments, KARGVA identifies more ARGVs than Resistance Gene Identifier (4.8x) and PointFinder (6.8x), yet all predictions are below the expected false positive estimates. The prevalence of ARGVs is correlated to ARGs but ecological characteristics do not explain well ARGV variance. KARGVA is publicly available athttps://github.com/DataIntellSystLab/KARGVAunder MIT license.
more » « less
Full Text Available

« Prev Next »

Search for: All records